Link for dataset: https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers
This dataset shows that some customers are leaving a credit card service. So, our job will be to investigate the data and try do predict who is gonna get churned, so we can contact proactively these customers to avoid this process
# Importing libraries
import pandas as pd
import plotly.express as px
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import shap
from ydata_profiling import ProfileReport
from matplotlib import gridspec
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
# Read the data
df = pd.read_csv("BankChurners.csv")
# Seeing all the columns names:
for column_headers in df.columns:
print(column_headers)
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1 Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
df.head(3)
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1 | Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | ... | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 | 0.000093 | 0.99991 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | ... | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 | 0.000057 | 0.99994 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | ... | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 | 0.000021 | 0.99998 |
3 rows × 23 columns
We will drop the CLIENTNUM column and the last 2 columns
columns = ["CLIENTNUM","Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1",
"Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2"]
df.drop(columns=columns, axis=1, inplace=True)
df.head(3)
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
# Seeing the data types
df.dtypes
Attrition_Flag object Customer_Age int64 Gender object Dependent_count int64 Education_Level object Marital_Status object Income_Category object Card_Category object Months_on_book int64 Total_Relationship_Count int64 Months_Inactive_12_mon int64 Contacts_Count_12_mon int64 Credit_Limit float64 Total_Revolving_Bal int64 Avg_Open_To_Buy float64 Total_Amt_Chng_Q4_Q1 float64 Total_Trans_Amt int64 Total_Trans_Ct int64 Total_Ct_Chng_Q4_Q1 float64 Avg_Utilization_Ratio float64 dtype: object
# Creating a profile for automated report
profile = ProfileReport(df, title="Credit_Card_Churn_Report")
# Exporting to a file
profile.to_file("Credit_Card_Churn_Report.html")
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
# Checking null values
df.isnull().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
None fo the features have missing values
# Generating a descriptive statistics about the data
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Customer_Age | 10127.0 | 46.325960 | 8.016814 | 26.0 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.0 | 2.346203 | 1.298908 | 0.0 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.0 | 35.928409 | 7.986416 | 13.0 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.0 | 3.812580 | 1.554408 | 1.0 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.0 | 2.341167 | 1.010622 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.0 | 2.455317 | 1.106225 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.0 | 8631.953698 | 9088.776650 | 1438.3 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.0 | 1162.814061 | 814.987335 | 0.0 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.0 | 7469.139637 | 9090.685324 | 3.0 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | 4404.086304 | 3397.129254 | 510.0 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.0 | 64.858695 | 23.472570 | 10.0 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
The features in general have a mean close to the median value, but the "Credit_Limit" and "Avg_Open_To_Buy" mean differ from the median. We will investigate this behavior through plots
labels = df.Attrition_Flag.value_counts().index
sizes = df.Attrition_Flag.value_counts()
explode=(0, 0.1)
fig1, ax1 = plt.subplots(figsize=(12,6))
ax1.pie(sizes, explode=explode, labels=labels, autopct="%1.1f%%",
shadow=True, startangle=90)
plt.title("Proportion of customer", size=20)
Text(0.5, 1.0, 'Proportion of customer')
We can see that the majority of the data (83,9%) refer to the Existing Customer (customer that have an account in the bank) and only 16,1 % refer to the people that canceled the account
fig = px.histogram(df,
"Attrition_Flag",
color="Attrition_Flag",
hover_name="Attrition_Flag",
)
fig.show()
fig = px.histogram(df,
"Education_Level",
color="Attrition_Flag",
hover_name="Attrition_Flag",
).update_xaxes(categoryorder="total descending")
fig.show()
We can see that the majority of the people are graduated and the minority have a doctorate level. However, we can see the presence of "Attritied Customer" in both levels, with a greater level in the Graduate in comparison with the Doctorate
fig = px.histogram(df,
x="Gender",
color="Attrition_Flag",
hover_name="Attrition_Flag",
).update_xaxes(categoryorder="total descending")
fig.show()
There are more females than males in the dataset. In addition, there are more woman with and without a bank account
fig = px.histogram(df,
x="Card_Category",
color="Attrition_Flag",
hover_name="Attrition_Flag",
).update_xaxes(categoryorder="total descending")
fig.show()
We observe that the great majority of the data refer to person with Blue card and the minority with Platinum. However, there are presence of people withou the bank account in all the 4 levels of card
px.histogram(df,
x="Dependent_count",
color="Attrition_Flag",
hover_name="Attrition_Flag",
)
We can see that most people have 2 to 3 children, and this is also the pattern in relation to the largest number of people who canceled their bank accounts
sns.displot(df,
x="Customer_Age",
kind="kde",
hue="Attrition_Flag"
)
<seaborn.axisgrid.FacetGrid at 0x163b997b5e0>
From the plot we can see that there are more Existing Customer than people that cancelled their bank account. In addition, the data for the two kind of person have the same behavior with more people between 40 and 50 years old
sns.displot(df,
x="Credit_Limit",
kind="kde",
hue="Attrition_Flag"
)
<seaborn.axisgrid.FacetGrid at 0x163b7c1d4c0>
We can observe that the people with an bank account and the people that cancelled their bank account have the same behavior pattern: the peaks are concetrated near the extreme values of the credit limit
sns.displot(df,
x="Avg_Utilization_Ratio",
kind="kde",
hue="Attrition_Flag"
)
<seaborn.axisgrid.FacetGrid at 0x163b7c3e310>
# Select numerical feature and exclude categorical feature
num_df = df.select_dtypes(exclude=["object"])
num_df.head(3)
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 45 | 3 | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 49 | 5 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 51 | 3 | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
# Check Pearson's correlation of the numerical features
fig, ax = plt.subplots(figsize=(12,8))
heatmap = sns.heatmap(
num_df.corr(),
cmap="Wistia",
annot=True,
fmt=".2f"
)
corr_matrix = num_df.corr()
Our target feature will be the "Attrition_Flag" column. So, we will drop that feature from the rest
X = df.drop(columns=["Attrition_Flag"])
X.head(3)
| Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
Now we will split our X dataframe in categorical and numerical data. After that, we will use the "One Hot Encoding" approach to get rid of the categorical values through the "pd.get_dummies" method
X_cat = X.select_dtypes(include=["object", "bool"]).columns
X_cat
Index(['Gender', 'Education_Level', 'Marital_Status', 'Income_Category',
'Card_Category'],
dtype='object')
X_num = X.select_dtypes(include=["int64", "float64"]).columns
X_num
Index(['Customer_Age', 'Dependent_count', 'Months_on_book',
'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
dtype='object')
# Numerical DataFrame
X_num_df = X[X_num]
X_num_df.head(3)
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 45 | 3 | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 49 | 5 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 51 | 3 | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
# Categorical DataFrame
X_cat_df = X[X_cat]
X_cat_df.head(3)
| Gender | Education_Level | Marital_Status | Income_Category | Card_Category | |
|---|---|---|---|---|---|
| 0 | M | High School | Married | $60K - $80K | Blue |
| 1 | F | Graduate | Single | Less than $40K | Blue |
| 2 | M | Graduate | Married | $80K - $120K | Blue |
# One Hot Encoding approach
X_cat_one_hot_encoded = pd.get_dummies(X_cat_df)
X_cat_one_hot_encoded.head(3)
| Gender_F | Gender_M | Education_Level_College | Education_Level_Doctorate | Education_Level_Graduate | Education_Level_High School | Education_Level_Post-Graduate | Education_Level_Uneducated | Education_Level_Unknown | Marital_Status_Divorced | ... | Income_Category_$120K + | Income_Category_$40K - $60K | Income_Category_$60K - $80K | Income_Category_$80K - $120K | Income_Category_Less than $40K | Income_Category_Unknown | Card_Category_Blue | Card_Category_Gold | Card_Category_Platinum | Card_Category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 2 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
3 rows × 23 columns
The "One Hot Encoding" approach creates a new column feature for every categorical data getting rid of the "names" values
# Recreating the X Dataframe
X_encoded = pd.concat([X_num_df, X_cat_one_hot_encoded], axis=1, join="inner")
X_encoded.head(3)
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | ... | Income_Category_$120K + | Income_Category_$40K - $60K | Income_Category_$60K - $80K | Income_Category_$80K - $120K | Income_Category_Less than $40K | Income_Category_Unknown | Card_Category_Blue | Card_Category_Gold | Card_Category_Platinum | Card_Category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 45 | 3 | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 49 | 5 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 2 | 51 | 3 | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
3 rows × 37 columns
# Target Feature
y = df["Attrition_Flag"]
y.head(3)
0 Existing Customer 1 Existing Customer 2 Existing Customer Name: Attrition_Flag, dtype: object
# Now transforming our target feature
# 0 means a person with an bank account
# 1 means a person that cancelled the bank account
df_transformado = df
df_transformado["Churn"] = np.where(df_transformado["Attrition_Flag"] == "Attrited Customer", 1, 0)
y = df_transformado["Churn"]
y.head(3)
0 0 1 0 2 0 Name: Churn, dtype: int32
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import lightgbm as lgb
from tqdm import tqdm
Now we will obtain metrics for 3 differents kinds of models
# Pré-alocando
tree_precision = []
tree_accuracy = []
tree_recall = []
tree_f1 = []
for i in tqdm(range(100)):
X_train, X_test, y_train, y_test = train_test_split(X_encoded,y,
test_size=0.33,
stratify=y)
tree_classifier = DecisionTreeClassifier()
tree_classifier.fit(X_train, y_train)
# Metrics
precision = metrics.precision_score(y_test, tree_classifier.predict(X_test))
accuracy = metrics.accuracy_score(y_test, tree_classifier.predict(X_test))
recall = metrics.recall_score(y_test, tree_classifier.predict(X_test))
f1 = metrics.f1_score(y_test, tree_classifier.predict(X_test))
# Append values
tree_precision.append(precision)
tree_accuracy.append(accuracy)
tree_recall.append(recall)
tree_f1.append(f1)
100%|██████████| 100/100 [00:18<00:00, 5.28it/s]
# Pré-alocando
forest_precision = []
forest_accuracy = []
forest_recall = []
forest_f1 = []
for i in tqdm(range(100)):
X_train, X_test, y_train, y_test = train_test_split(X_encoded,y,
test_size=0.33,
stratify=y)
forest_classifier = RandomForestClassifier()
forest_classifier.fit(X_train, y_train)
# Metrics
precision = metrics.precision_score(y_test, forest_classifier.predict(X_test))
accuracy = metrics.accuracy_score(y_test, forest_classifier.predict(X_test))
recall = metrics.recall_score(y_test, forest_classifier.predict(X_test))
f1 = metrics.f1_score(y_test, forest_classifier.predict(X_test))
# Append values
forest_precision.append(precision)
forest_accuracy.append(accuracy)
forest_recall.append(recall)
forest_f1.append(f1)
100%|██████████| 100/100 [02:35<00:00, 1.55s/it]
# Pré-alocando
lgb_precision = []
lgb_accuracy = []
lgb_recall = []
lgb_f1 = []
for i in tqdm(range(100)):
X_train, X_test, y_train, y_test = train_test_split(X_encoded,y,
test_size=0.33,
stratify=y)
lgb_classifier = lgb.LGBMClassifier()
lgb_classifier.fit(X_train, y_train)
# Metrics
precision = metrics.precision_score(y_test, lgb_classifier.predict(X_test))
accuracy = metrics.accuracy_score(y_test, lgb_classifier.predict(X_test))
recall = metrics.recall_score(y_test, lgb_classifier.predict(X_test))
f1 = metrics.f1_score(y_test, lgb_classifier.predict(X_test))
# Append values
lgb_precision.append(precision)
lgb_accuracy.append(accuracy)
lgb_recall.append(recall)
lgb_f1.append(f1)
100%|██████████| 100/100 [00:31<00:00, 3.19it/s]
# Preparing the data results to plot
precision_comparison = pd.DataFrame({"DecisionTree":tree_precision,
"RandomForest":forest_precision,
"LGBM":lgb_precision})
accuracy_comparison = pd.DataFrame({"DecisionTree":tree_accuracy,
"RandomForest":forest_accuracy,
"LGBM":lgb_accuracy})
recall_comparison = pd.DataFrame({"DecisionTree":tree_recall,
"RandomForest":forest_recall,
"LGBM":lgb_recall})
f1_comparison = pd.DataFrame({"DecisionTree":tree_f1,
"RandomForest":forest_f1,
"LGBM":lgb_f1})
Precision talks about how precise the model is. So, out of the predicted positive, this metric shows how many of them are actual positive
fig = plt.figure(figsize = (10,6))
sns.boxplot(data=precision_comparison).set(title="Precision Score")
[Text(0.5, 1.0, 'Precision Score')]
Accuracy generally describes the performance of the model across all classes. Basically, it is the fraction of predictions that our model got right
fig = plt.figure(figsize = (10,6))
sns.boxplot(data=accuracy_comparison).set(title="Accuracy Score")
[Text(0.5, 1.0, 'Accuracy Score')]
The recall metric quantifies the number of correct positive predictions that ou model made out of all positive predictions that could possibily been made
fig = plt.figure(figsize = (10,6))
sns.boxplot(data=recall_comparison).set(title="Recall Score")
[Text(0.5, 1.0, 'Recall Score')]
The F-Score gives us a single score that balances both the precision and recall
fig = plt.figure(figsize = (12,6))
b = sns.boxplot(data=f1_comparison).set(title="F1 Score")
We can see from the plots, principally "Precision" and "F1 Score" that the LGBM Classifier is the best classifier for our data
X_train, X_test, y_train, y_test = train_test_split(X_encoded,y,
test_size=0.33,
stratify=y)
lgb_classifier = lgb.LGBMClassifier()
lgb_classifier.fit(X_train, y_train)
LGBMClassifier()
explainer_lgbm = shap.Explainer(lgb_classifier.predict, X_test)
shap_values_lgbm = explainer_lgbm(X_test)
Permutation explainer: 3343it [09:16, 5.93it/s]
shap.plots.waterfall(shap_values_lgbm[0])
shap.plots.beeswarm(shap_values_lgbm)